Influences on the hotel market#

Student names: Prashant Jawalapersad, Jonathan Ogbuli, Jon Hoogervorst, Jacob Jan Woord

Team number: J2

Link to online github repository: https://jacobjanwoord.github.io/Info-Vis_first/docs/home.html

Introduction#

Millions of people everyday book a hotel online for various reasons. It can be because of a business trip or maybe just a relaxing vacation. But these bookings play a big role in influencing the success of hotels. We as a team are from the hotel market industry, and that is why we want to analyze what factors determine the hotel booking market.

In this data story we try to find the factors that determine the hotel booking market, and therefore also have a great influence on the market. Our analysis is based on three main perspectives: the number of flight passengers on a plane that booked a hotel and some analysis between city hotels and resorts. By using these perspectives we hope to get a better insight in the correlations between the hotel booking market and these factors. We try to find correlations

The analysis is based on datasets with a few good correlated variables, that provide good information on the hotel booking market. We use a dataset about hotel bookings that has information on the booking made by tourists, a passenger flight dataset that contains the number of people in a flight, the last dataset is about the amount of money a tourists spends during stay and a country dataset containing information on the economic side of a tourist’s home country. The variables we use in this dataset will be clear later on in the story.

Perspective 1: During summer-time resorts perform better than city hotels in comparison with other months

In the first perspective we focus more on the performance side of things, and is about comparing city hotel statistics with statistics of a resort. For this we analyze a few aspects of both such as, average daily rate meaning the hospitality of a place, the duration someone stays in a hotel (weekdays and weekends) and t.b.d.. With summer-time being the months June, July and August. Us analyzing the performance of these 2 sorts is interesting for the market, because when having the performance analyzed we can adjust prices accordingly and identify certain trends around the summer.

Perspective 2: Hotel booking market is determined by the relative increase/decrease of a flight passengers compared to previous months

The second perspective is about the relationship between hotel stay durations and flight passenger numbers. We check whether a higher number of passengers on a plane leads to a longer duration of stay in a hotel (weekdays and weekends). Us analyzing the influence of the amount of passengers on the duration of stay is important, because the longer the tourist stays in the hotel, the easier it gets managing other bookings with having less to worry about check-outs.

With this datastory we are going to analyze, but more importantly visualize data, to gain some insights on the hotel booking market.

Dataset and pre-processing#

Our first dataset is about hotel bookings. This dataset contains data of over 100.000 bookings. It contains data about when the hotel was booked, how long the stay was, the country of the person booking the hotel and much more. We split the dataset based on the type of hotel. So we now have a dataset for resorts and one for city hotels. Then we grouped the datasets based on the month of the booking and aggregated them. We used to mean function to do this. After this we added a Month_Year column and a Hotel Type column. The Month_Year was made to make the plots easier and the Hotel Type column was made because this information was lost while aggregating the datasets. The resulting datasets will be used for perspective 1.

We combined the original hotel booking dataset with the Air traffic passengers dataset. This dataset has data about Air traffic and it tracks data like operating airline, terminal and passenger count. Both the hotel booking dataset and the Air traffic dataset where aggregated based on the month using the mean function. This allowed us to merge the two datasets based on month. This new dataset was used for perspective 2.

These datasets were found on Kaggle. Here are the links:
Hotel bookings - https://www.kaggle.com/datasets/mojtaba142/hotel-booking
Air traffic passengers - https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-statistics

Preprocessing code#

Hide code cell source
# imports
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
Hide code cell source
# get the data from the files.
hotel_data = pd.read_table("hotel_booking.csv", delimiter=";")
air_data = pd.read_table("Air_Traffic_Passenger_Statistics.csv", delimiter=";")

Preprocessing Perspective 1

Hide code cell source
# split hotel data in resort hotels and city hotels
resort_hotels = hotel_data[hotel_data['hotel'] == 'Resort Hotel']
city_hotels = hotel_data[hotel_data['hotel'] == 'City Hotel']

# group resort hotels based on month and year
grouped_data_resort = resort_hotels.groupby(['arrival_date_month','arrival_date_year'])
resort_aggregate = grouped_data_resort.aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# group city hotels based on month and year
grouped_data_city = city_hotels.groupby(['arrival_date_month','arrival_date_year'])
city_aggregate = grouped_data_city.aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# add Month_Year column to datasets 
resort_aggregate['Month_Year'] = resort_aggregate.index.get_level_values(0) + ' ' + resort_aggregate.index.get_level_values(1).astype(str)
city_aggregate['Month_Year'] = city_aggregate.index.get_level_values(0) + ' ' + city_aggregate.index.get_level_values(1).astype(str)

# order of the months for plots
month_year_order = ['August 2015', 'September 2015', 'October 2015',
                    'November 2015', 'December 2015', 'January 2016',
                    'February 2016', 'March 2016', 'April 2016',
                    'May 2016', 'June 2016', 'July 2016',
                    'August 2016', 'September 2016', 'October 2016',
                    'November 2016', 'December 2016', 'January 2017',
                    'February 2017', 'March 2017', 'April 2017',
                    'May 2017', 'June 2017', 'July 2017',
                    'August 2017']

# order datasets based on Month_Year column
resort_aggregate['Month_Year'] = pd.Categorical(resort_aggregate['Month_Year'], categories=month_year_order, ordered=True)
rdf_sorted = resort_aggregate.sort_values('Month_Year')

city_aggregate['Month_Year'] = pd.Categorical(city_aggregate['Month_Year'], categories=month_year_order, ordered=True)
cdf_sorted = city_aggregate.sort_values('Month_Year')

# add hotel type column to datasets and combine them
resort_aggregate['Hotel Type'] = 'Resort Hotel'
city_aggregate['Hotel Type'] = 'City Hotel'
combined_data = pd.concat([city_aggregate, resort_aggregate])

Preprocessing Perspective 2

Hide code cell source
# group and aggregate the hotel booking dataset based on month using the mean function
aggregate_hotel_month = hotel_data.groupby("arrival_date_month").aggregate({
    'is_canceled': 'mean',
    'lead_time': 'mean',
    'is_repeated_guest': 'mean',
    'previous_cancellations': 'mean',
    'previous_bookings_not_canceled': 'mean',
    'booking_changes': 'mean',
    'stays_in_weekend_nights': 'mean',
    'stays_in_week_nights': 'mean',
    'adults': 'mean',
    'children': 'mean',
    'babies': 'mean',
    'adr': 'mean',
    'required_car_parking_spaces': 'mean',
    'required_car_parking_spaces': 'mean'
})

# group and aggregate the air traffic dataset based on month using the mean function
aggregate_air_month = air_data.groupby("Month").aggregate({
    'Passenger Count': 'mean',
    'Adjusted Passenger Count': 'mean'
})

# merge hotel booking data with air traffic data
concat_air = aggregate_hotel_month.merge(aggregate_air_month, left_on='arrival_date_month', right_on='Month')

# month names
month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

Perspective 1:#

During summer-time resorts perform better than city hotels in comparison with other months

  • Argument 1:
    per day expenses is correlated to length of stay. (Visualization 5)

Hide code cell source
# Visualization 1: length of stay per month of resort and city hotels for weekend and week nights
# colors for plot
colors = px.colors.qualitative.T10

# city hotel weekend stay trace
trace1 = go.Bar(
    x=cdf_sorted['Month_Year'],
    y=cdf_sorted['stays_in_weekend_nights'],
    name='City Hotel (weekend)',
    marker=dict(color=colors[0])
)

# resort hotel weekend stay trace
trace2 = go.Bar(
    x=rdf_sorted['Month_Year'],
    y=rdf_sorted['stays_in_weekend_nights'],
    name='Resort Hotel (weekend)',
    marker=dict(color=colors[1])
)

# city hotel week stay trace
trace3 = go.Bar(
    x=cdf_sorted['Month_Year'],
    y=cdf_sorted['stays_in_week_nights'],
    name='City Hotel (week)',
    marker=dict(color=colors[2])
)

# resort hotel week stay trace
trace4 = go.Bar(
    x=rdf_sorted['Month_Year'],
    y=rdf_sorted['stays_in_week_nights'],
    name='Resort Hotel (week)',
    marker=dict(color=colors[3])
)

# combine traces
data = [trace1, trace2, trace3, trace4]

# layout of the plot
layout = go.Layout(
    title='<b>Visualization 1: Comparing the amount of nights spent at a City and a Resort hotel</b>',
    xaxis=go.layout.XAxis(
        title='<b>Month and Year</b>',
        type='category'
    ),
    yaxis=go.layout.YAxis(
        title='<b>Amount of weekend nights</b>',
    ),
    barmode='group',

    # choice between weekend and week traces
    updatemenus=[ # used ChatGPT to make it interactive
        dict(
            buttons=list([
                dict(
                    args=[{'visible': [True, True, False, False]}, {'yaxis.title': '<b>Amount of weekend nights</b>'}],
                    label='Weekends',
                    method='update'
                ),
                dict(
                    args=[{'visible': [False, False, True, True]}, {'yaxis.title': '<b>Amount of week nights</b>'}],
                    label='Weekdays',
                    method='update'
                )
            ]),
            direction='down',
            showactive=True,
            x=0.08,
            y=1.15
        )
    ]
)

data[2].visible = False
data[3].visible = False

figure = go.Figure(data=data, layout=layout)
figure.show()

*Figure 1: explanation of figure and statistics used

  • Argument 2:
    total expenses is negatively correlated to is repeated guest. (Visualization 6)

Hide code cell source
# Visualization 2: ADR for city and resort hotel, also per month
# box plot of adr for city and resort hotels
fig = px.box(combined_data, x='Hotel Type', y='adr', title='<b>Visualization 2: Comparison of ADR for a City and a Resort Hotel</b>')
fig.update_yaxes(
    title='<b>Average Daily Rate in euros</b>', secondary_y=False
)
fig.update_xaxes(title='<b>Hotel Type</b>')
fig.show()

# ADR for city hotels per month
trace1 = go.Bar(
    x = cdf_sorted['Month_Year'],
    y = cdf_sorted['adr'],
    name='City Hotel',
    marker=dict(color='rgb(102,194,165)')
)

# ADR for resort hotels per month
trace2 = go.Bar(
    x = rdf_sorted['Month_Year'],
    y = rdf_sorted['adr'],
    name='Resort Hotel',
    marker=dict(color='rgb(255, 141, 98)')
)

# combine traces
data = [trace1, trace2]

# set layout of bar plot
layout = go.Layout(
    title='<b>Visualization 2: Comparison of ADR for a City and a Resort Hotel per month</b>',
    xaxis=go.layout.XAxis(
        title='<b>Month and Year</b>',
        type='category'
    ),
    yaxis=go.layout.YAxis(
        title='<b>Average Daily Rate in euros</b>',
    ),
    barmode='group',
)

go.Figure(data=data, layout=layout).show()

*Figure 2: explanation of figure and statistics used

*Figure 3: explanation of figure and statistics used

Perspective 2:#

Hotel booking market is determined by the relative increase/decrease of air traffic passenger count compared to previous months

  • Argument 1:
    When air traffic passengers increase the length of hotel stays also increases, when air traffic passengers decreases the length of hotel stays also decreases. So the hotel booking market and the amount of air traffic passengers are correlated. (Visualization 3)

Hide code cell source
# Visualization 4: Passenger Count vs stays_in_week_nights and stays_in_weekend_nights per month
# passenger count bar
trace_pass = go.Bar(
    x = month_names,
    y = concat_air['Passenger Count'],
    name='Passenger Count'
)

# stays in week nights line
trace_week = go.Scatter(
    x = month_names,
    y = concat_air['stays_in_week_nights'],
    name='Stays in week nights', line=dict(width=4), marker=dict(size=12)
)

# stays in weekend nights line
trace_end = go.Scatter(
    x = month_names,
    y = concat_air['stays_in_weekend_nights'],
    name='Stays in weekend nights', line=dict(width=4), marker=dict(size=12, color='orange')
)

# second y, right side
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(trace_pass, secondary_y=False)
fig.add_trace(trace_week, secondary_y=True)
fig.add_trace(trace_end, secondary_y=True)

# add title
fig.update_layout(
    title_text="<b>Visualization 4: Analysis of airline passenger count and length of hotel stay</b>"
)

# x-axis title
fig.update_xaxes(title_text="<b>Month</b>")

# y-axes titles
fig.update_yaxes(title_text="<b>Passenger Count</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Average length of hotel booking</b>", secondary_y=True)

fig.show()

*Figure 4: The x axis of this visualization shows the months and the x axis shows the average air traffic passenger count on the left side (blue bars) and on the right side it shows the amount of nights in the stay for week (red line) and weekend (green line). We see that all three variables go up and down in roughly the same months. We can see that months with a higher passenger count also have longer stays.

Passenger count is the average air traffic passenger counts per month. Stays in week nights and stays in weekend nights show how long the hotel bookings are.

  • Argument 2:
    People tend to spend more in months with more air traffic passengers. We see that the average daily rate (ADR), which tells us how much people spend per day on their hotel, is more in months that have more air traffic passengers. (Visualization 4)

Hide code cell source
# Visualization 5: passenger count vs ADR per month
# passenger count bar
trace_pass = go.Bar(
    x = month_names,
    y = concat_air['Passenger Count'],
    name='Passenger Count'
)

# adr line
hotel_data = hotel_data[hotel_data['adr'] < 4000]
trace_adr = go.Box(x=hotel_data["arrival_date_month"], y=hotel_data["adr"], boxpoints=False, name="Average Daily Rate")

# second y, right side
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(trace_pass, secondary_y=False)
fig.add_trace(trace_adr, secondary_y=True)

# add title
fig.update_layout(
    title_text="<b>Visualization 5: Analysis of airline passenger count and ADR</b>"
)

# x-axis title
fig.update_xaxes(title_text="<b>Month</b>")
fig.update_xaxes(categoryorder='array', categoryarray= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])


# y-axes titles
fig.update_yaxes(title_text="<b>Passenger count</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>ADR</b>", secondary_y=True)

fig.show()

*Figure 5: This visualization shows that months with a higher passenger count also have a higher ADR. ADR stands for average daily rate and this measures how much people spend on their booking per day. So we can conclude that hotel bookings are more expensive in months were the air traffic passenger count is higher. The x axis shows the months. The left y axis shows the passenger count and the right y axis shows the ADR.

  • Argument 3:
    In months were passenger count is higher the bookings which arrived in these months have a higher average lead time. A higher lead time is beneficial for the hotel market, so they can make sure their hotel is fully booked. (Visualization 6)

Hide code cell source
# Visualization 6: passenger count vs lead time per month
# create scatter plot passenger count and month, color is based on lead time
fig = px.scatter(
    concat_air, 
    y=month_names, 
    x="Passenger Count", color='lead_time', 
    title='<b>Visualization 6: Analysis of Passenger count vs lead time per month</b>'
)

# styling code
fig.update_traces(marker_size=15)
fig.update_layout(
    yaxis_title="<b>Month</b>", 
    xaxis_title="<b>Passenger Count</b>", 
    coloraxis_colorbar=dict(
        title="<b>Lead Time</b>",
        thicknessmode="pixels",
        yanchor="top", y=1,
        ticks="outside",
        dtick=10
    )
)

fig.show()

*Figure 6: This visualization shows that months with a higher passenger count also have a higher ADR. ADR stands for average daily rate and this measures how much people spend on their booking per day. So we can conclude that hotel bookings are more expensive in months were the air traffic passenger count is higher. The x axis shows the months. The left y axis shows the passenger count and the right y axis shows the ADR.

Reflection#

During the making of this project, we got a moment to reflect on our datastory. In this reflection moment we got lots of critiques that we could use to improve our datastory. We got the feedback from our peers and our teaching assistant. During the peer feedback it became clear to us, what points of our data story were confusing and what we did well. Visually we were doing quite well, but we didn’t really have a well-running story and the structure was also not good.

The reason that our story was not well-running was because of our perspectives being too broad. One suggestion we got from our teaching assistant was that we needed to cut out at least one perspective to make the structure of the story clearer. With this advice in mind we decided to cut out two of our perspectives and add a fresh one to our story.

Some suggestions that we got from our peers were to clear up the arguments, because they were not very understandable at first sight. There were a few other points which we also addressed in our plots, such as changing variable order in the plots, and changing values.

Work Distribution#

When making this project we tried to divide the work as evenly as possible. At the start of the project we divided the work based on our weaknesses as a person, for the reason of improving and learning new skills during the project. After we got the feedback we decided to stir things up. Instead of dividing on weakness we decided to divide the tasks based on the strength of our members., because of the amount of things we needed to do in such a short time frame.

Jonathan took the lead in making perspective 1, and was responsible for making two visualizations with arguments.
Jon also contributed to perspective 1, by making another visualization, and took on the role of task management. He ensured that deadlines were met and the progress was on track.
Jacob Jan played a huge role in preprocessing the datasets and refining arguments. He was also responsible for making perspective 2 and uploading the final project on the jupyter book git page.
Prashant had a multifaceted role in the team. And helped with visualizations and arguments when help was needed. Also responsible for the few writing sections on the project, such as introduction, reflection and work distribution.

Overall, we as a team enjoyed this project and managed to manage and divide the tasks well.

References#

references